De Novo Genome Assembly ◾ 97
less ecoli-contigs.fa
less ecoli-scaffolds.fa
Use “awk” command to print the length of the longest scaffold in the scaffold file.
awk ‘{print length}’ ecoli-contigs.fa | sort -n | tail -n1
3.2.2 SPAdes
SPAdes [9] is a de novo genome assembler developed primarily for assembling small
genomes of bacteria. Later, modules were added for assembling small genomes of other
organisms including fungi and viruses. It is not recommended for assembling large mam-
malian genomes. The current SPAdes version works with both Illumina and Ion Torrent
reads, and it can be used for genome hybrid assembly for PacBio, Oxford Nanopore, and
Sanger reads. This assembler can process several paired-end and mate-paired files in the
same time. The program also provides separate modules for metagenomic data, plasmid
assembly from the whole genome sequencing data, plasmid from metagenomic data, tran-
scriptome assembly from RNA-Seq data, biosynthetic gene cluster assembly with paired-
end reads, viral genome assembly from RNA viral data, SARS-CoV-2 assembly, and TruSeq
barcode assembly. The assembling process of SPAdes includes four stages. First, de Bruijn
graphs are built from overlapping k-mers generated from the reads. Second, the k-mers are
adjusted to obtain accurate distance estimates between k-mers using both distance histo-
grams and paths in the assembly graphs. The program then constructs paired de Bruijn
graphs, which is a generalization of the de Bruijn graph that incorporates mate-pair infor-
mation into the graph structure [10]. Finally, contigs are constructed from the graphs.
SPAdes program is made up of modules in Python. The installation instructions are
available at “https://cab.spbu.ru/files/release3.15.4/manual.html”. To install the program
on Linux, use the following steps:
Using the Linux terminal, first download and decompress the source program in a local
directory.
wget https://cab.spbu.ru/files/release3.15.4/SPAdes-3.15.4-Linux.
tar.gz
tar -xzf SPAdes-3.15.4-Linux.tar.gz
Notice that the program name or path may change in the future.
FIGURE 3.7 Genome assembly metrics.